Goto

Collaborating Authors

 raw audio



The challenge of realistic music generation: modelling raw audio at scale

Neural Information Processing Systems

Realistic music generation is a challenging task. When building generative models of music that are learnt from data, typically high-level representations such as scores or MIDI are used that abstract away the idiosyncrasies of a particular performance. But these nuances are very important for our perception of musicality and realism, so in this work we embark on modelling music in the raw audio domain. It has been shown that autoregressive models excel at generating raw audio waveforms of speech, but when applied to music, we find them biased towards capturing local signal structure at the expense of modelling long-range correlations. This is problematic because music exhibits structure at many different timescales. In this work, we explore autoregressive discrete autoencoders (ADAs) as a means to enable autoregressive models to capture long-range correlations in waveforms. We find that they allow us to unconditionally generate piano music directly in the raw audio domain, which shows stylistic consistency across tens of seconds.



SwissGPC v1.0 -- The Swiss German Podcasts Corpus

Stucki, Samuel, Cieliebak, Mark, Deriu, Jan

arXiv.org Artificial Intelligence

We present SwissGPC v1.0, the first mid-to-large-scale corpus of spontaneous Swiss German speech, developed to support research in ASR, TTS, dialect identification, and related fields. The dataset consists of links to talk shows and podcasts hosted on Schweizer Radio und Fernsehen and YouTube, which contain approximately 5400 hours of raw audio. After segmentation and weak annotation, nearly 5000 hours of speech were retained, covering the seven major Swiss German dialect regions alongside Standard German. We describe the corpus construction methodology, including an automated annotation pipeline, and provide statistics on dialect distribution, token counts, and segmentation characteristics. Unlike existing Swiss German speech corpora, which primarily feature controlled speech, this corpus captures natural, spontaneous conversations, making it a valuable resource for real-world speech applications.


Reviews: The challenge of realistic music generation: modelling raw audio at scale

Neural Information Processing Systems

The authors claim that there is no suitable metric to evaluate the quality of the generated audio, which is plausible, so they listened to the audio and evaluated on their own. The only shortcoming here is that no systematic and blind listening test has been conducted yet. The authors themselves might be biased and thus, the capabilities of the proposed approach cannot be considered as fully proven from a scientific perspective. However, a link to the audio is provided so that the readers can convince themselves from the proposed method. Minor comments: -"nats per timestep": should be defined -p. 3, l.


Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles

López, Fernando, Luque, Jordi, Segura, Carlos, Gómez, Pablo

arXiv.org Artificial Intelligence

Voice-based interfaces rely on a wake-up word mechanism to initiate communication with devices. However, achieving a robust, energy-efficient, and fast detection remains a challenge. This paper addresses these real production needs by enhancing data with temporal alignments and using detection based on two phases with multi-resolution. It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side, which is an ensemble of heterogeneous architectures that refine detection. This scheme allows the optimization of two operating points. To protect privacy, audio features are sent to the cloud instead of raw audio. The study investigated different parametric configurations for feature extraction to select one for on-device detection and another for the verification model. Furthermore, thirteen different audio classifiers were compared in terms of performance and inference time. The proposed ensemble outperforms our stronger classifier in every noise condition.


Jukebox

#artificialintelligence

This has led to impressive results like producing Bach chorals,[ reference-5][ reference-6] polyphonic music with multiple instruments,[ reference-7][ reference-8][ reference-9] as well as minute long musical pieces.[ But symbolic generators have limitations--they cannot capture human voices or many of the more subtle timbres, dynamics, and expressivity that are essential to music. A different approach[ footnote-approach] is to model music directly as raw audio.[ For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game. Thus, to learn the high level semantics of music, a model would have to deal with extremely long-range dependencies.


A ResNet attention model for classifying mosquitoes from wing-beating sounds - Scientific Reports

#artificialintelligence

Mosquitoes are vectors of numerous deadly diseases, and mosquito classification task is vital for their control programs. To ease manual labor and time-consuming classification tasks, numerous image-based machine-learning (ML) models have been developed to classify different mosquito species. Mosquito wing-beating sounds can serve as a unique classifier for mosquito classification tasks, which can be adopted easily in field applications. The current study aims to develop a deep neural network model to identify six mosquito species of three different genera, based on their wing-beating sounds. While existing models focused on raw audios, we developed a comprehensive pre-processing step to convert raw audios into more informative Mel-spectrograms, resulting in more robust and noise-free extracted features. Our model, namely ’Wing-beating Network’ or ’WbNet’, combines the state-of-art residual neural network (ResNet) model as a baseline, with self-attention mechanism and data-augmentation technique, and outperformed other existing models. The WbNet achieved the highest performance of 89.9% and 98.9% for WINGBEATS and ABUZZ data respectively. For species of Aedes and Culex genera, our model achieved 100% precision, recall and F1-scores, whereas, for Anopheles, the WbNet reached above 95%. We also compared two existing wing-beating datasets, namely WINGBEATS and ABUZZ, and found our model does not need sophisticated audio devices, hence performed better on ABUZZ audios, captured on usual mobile devices. Overall, our model has potential to serve in mosquito monitoring and prevalence studies in mosquito eradication programs, along with potential implementation in classification tasks of insect pests or other sound-based classifications.


(Deep) House: Making AI-Generated House Music

#artificialintelligence

People have been trying to make machine generated music for a long time. Some of the earliest examples were musicians punching holes in piano roles to create complex melodies unplayable by humans (see Conlon Nancarrow, 1947). More recently, it's looked like electronic music in the form of MIDI files, where, by specifying various attributes --the instrument, pitch, duration, and timing--songs can be symbolically represented. But what does it look like for AI to run the whole generation process? This article explores generative audio techniques, training OpenAI's Jukebox on hours of house music.


Artificial Intelligence and music creation: What is OpenAI's Jukebox? Purple Sneakers

#artificialintelligence

The future is now people. Not only do we have pandemic-proof rave suits being designed, we also now might be on the precipice of having music released made with Artificial Intelligence thanks to the latest development from OpenAI. Aptly titled'Jukebox', the new model is now able to generate genre-specific music. According to OpenAI's website, Jukebox is "a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles." Using over 1.6million songs as their dataset, Jukebox is able to use a song provided as input, and generate a sample produced from scratch in specific genres as output.